133 research outputs found

    Towards a Statistical Methodology to Evaluate Program Speedups and their Optimisation Techniques

    Full text link
    The community of program optimisation and analysis, code performance evaluation, parallelisation and optimising compilation has published since many decades hundreds of research and engineering articles in major conferences and journals. These articles study efficient algorithms, strategies and techniques to accelerate programs execution times, or optimise other performance metrics (MIPS, code size, energy/power, MFLOPS, etc.). Many speedups are published, but nobody is able to reproduce them exactly. The non-reproducibility of our research results is a dark point of the art, and we cannot be qualified as {\it computer scientists} if we do not provide rigorous experimental methodology. This article provides a first effort towards a correct statistical protocol for analysing and measuring speedups. As we will see, some common mistakes are done by the community inside published articles, explaining part of the non-reproducibility of the results. Our current article is not sufficient by its own to deliver a complete experimental methodology, further efforts must be done by the community to decide about a common protocol for our future experiences. Anyway, our community should take care about the aspect of reproducibility of the results in the future.Comment: 12 page

    Cyclic Task Scheduling with Storage Requirement Minimization under Specific Architectural Constraints: Case of Buffers and Rotating Storage Facilities

    Get PDF
    This is a continuation work to SIRA (Sid-Ahmed-Ali Touati and Christine Eisenbeis. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters, Vol. 14, No. 2, June 2004. World Scientific.). We exetend that work with new heuristics and experimental results.In this report, we study the exact and an approximate formulation of the general problem of one-dimensional periodic task scheduling under storage requirement, irrespective of machine constraints. We rely on the SIRA theoretical framework that allows an optimisation of periodic storage requirement \cite{Touati:PPL:04}. SIRA is based on inserting some storage dependence arcs ({\it storage reuse} arcs) labeled with {\it reuse distances} directly on the data dependence graph. In this new graph, we are able to bound the storage requirement measured as the exact number of necessary storage locations. The determination of storage and distance reuse is parametrised by the desired minimal scheduling period (respectively maximal execution throughput) as well as by the storage requirement constraints - either can be minimised while the other one is bounded, or alternatively, both are bounded \cite{siralina07,RR-INRIA-HAL-00436348}. This report recalls our fundamental results on this problem, and proposes new experimental heuristics. We typically show how we can deal with some specific storage architectural constraints such as buffers and rotating storage facilities

    Optimal acyclic fine-grain scheduling with cache effects for embedded and real time systems

    Get PDF
    International audienceTo sustain the increases in processor performance, embedded and real-time systems need to find the best total schedule time when compiling their application. The optimal acyclic scheduling problem is a classical challenge which has been formulated using integer programming in lot of works. In this paper, we give a new formulation of acyclic instruction scheduling problem under registers and resources constraints in multiple instructions issuing processors with cache effects. Given a direct acyclic graph G=(V,E), the complexity of our integer linear programming model is bounded by O(|V|2) variables and O(|E|+|V|2) constraints. This complexity is better than the complexity of the existing techniques which includes a worst total schedule time factor

    Maximizing for Reducing the Register Need in Acyclic Schedules

    Get PDF
    International audience. March 20th 2001. , Germany

    Register Saturation in Superscalar and VLIW Codes

    Get PDF
    International audienceThe registers constraints can be taken into account during the scheduling phase of an acyclic data dependence graph (DAG): any schedule must minimize the register requirement. In this work, we mathematically study and extend the approach which consists of computing the exact upper-bound of the register need for all the valid schedules, independently of the functional unit constraints. A previous work (URSA) was presented in [5,4]. Its aim was to add some serial arcs to the original DAG such that the worst register need does not exceed the number of available registers. We write an appropriate mathematical formalism for this problem and extend the DAG model to take into account delayed read from and write into registers with multiple registers types. This formulation permits us to provide in this paper better heuristics and strategies (nearly optimal), and we prove that the URSA technique is not sufficient to compute the maximal register requirement, even if its solution is optimal

    Understanding microarchitectural effects on the performance of parallel applications

    Get PDF
    International audienceSince several years, classical multiprocessor systems have evolved to multicores, which tightly integrate multiple CPU cores on a single die or package. This technological shift leads to sharing of microarchitectural resources between the individual cores, which has direct implications on the performance of parallel applications. It consequently makes understanding and tuning these significantly harder, besides the already complex issues of parallel programming. In this work, we empirically analyze various microarchitectural effects on the performance of parallel applications, through repeatable experiments. We show their importance, besides the effects described by Amdahl's law and synchronization or communication considerations. In addition to the classification of shared resources into storage and bandwidth resources of Abel et al. [1], we view the physical temperature and power budget also as a shared resource. Dynamic Voltage and Frequency Scaling (DVFS) over a wide range is needed to meet these constraints in multicores, thus it is a very important factor for performance nowadays. Our work aims to gain a better understanding of performance-limiting factors in high performance multicores, it shall serve as a basis to avoid them and to find solutions to tune parallel applications

    Cyclic Register Pressure and Allocation for Modulo Scheduled Loops

    Get PDF
    In a previous work, we have introduced the notion of register saturation in directed acyclic graphs (basic blocks) which is the maximal number of registers needed to complete a computation in a multiple issue processor. In this report, we extend our work to the cyclic case¸: given a data dependence graph of a simple loop, the cyclic register saturation is the maximal number of registers needed by any modulo schedule of this DDG. If the register saturation is lower than the number of available registers R, then we can build a software pipelining schedule without including the registers constraints¸: we are sure that any schedule does not need more than R registers, and so no spill code has to be generated. If not, we add some arcs in the DDG such that any modulo schedule would not require more than R registers, while minimizing the critical circuit. Next, we introduce and study the dual notion¸: the cyclic register sufficiency is the minimal number of registers needed by any modulo schedule of the DDG. If this factor is greater than R, spill code cannot be avoided. Finally, we shall show how to construct a cyclic register allocation in the data dependence graph independently from any software pipelined schedule. We insert some anti-depend- ences into the original DDG to express the register reuse relations between statements, such that the number of consumed registers does not exceed R under a fixed execution rate (initiation interval)

    SIRA: Schedule Independent Register Allocation for Software Pipelining

    Get PDF
    International audienceThe register allocation in loops is generally carried out after or during the software pipelining process. This is because doing the register allocation at first step without assuming a schedule lacks the information of interferences between values live ranges. The register allocator introduces extra false dependencies which reduces dramatically the original ILP (Instruction Level Parallelism). In this paper, we give a new formulation to carry out the register allocation before the scheduling process, directly on the data dependence graph by inserting some anti dependencies arcs (reuse edges). This graph extension is first constrained by minimizing the critical cycle and hence minimizing the ILP loss due to the register pressure. The second constraint is to ensure that there is always a cyclic register allocation with the set of available registers, and this for any software pipelining of the new graph. We give the exact formulation of this problem with linear integer programming

    FADAlib: an open source C++ library for fuzzy array dataflow analysis

    Get PDF
    AbstractUbiquitous multicore architectures require that many levels of parallelism have to be found in codes. Dependence analysis is the main approach in compilers for the detection of parallelism. It enables vectorisation and automatic parallelisation, among many other optimising transformations, and is therefore of crucial importance for optimising compilers.This paper presents new open source software, FADAlib, performing an instance-wise dataflow analysis for scalar and array references. The software is a C++ implementation of the Fuzzy Array Dataflow Analysis (FADA) method. This method can be applied on codes with irregular control such as while-loops, if-then-else or non-regular array accesses, and computes exact instance-wise dataflow analysis on regular codes. As far as we know, FADAlib is the first released open source C++ implementation of instance-wise data flow dependence handling larger classes of programs. In addition, the library is technically independent from an existing compiler; It can be plugged in many of them; this article shows an example of a successful integration inside gcc/GRAPHITE. We give details concerning the library implementation and then report some initial results with gcc and possible use for trace scheduling on irregular codes

    Load-Store Optimization For Software Pipelining

    Get PDF
    Special issue on interaction between compilers and computer architectures.International audienceSoftware pipelining can generate efficient schedules for loop by overlapping the execution of operations from different iterations in order to exploit maximum Instruction Level Parallelism (ILP). Code optimization can decrease total number of calculations and memory related operations. As a result, instruction schedules can use freed resources to construct shorter schedules. Particularly, when the data is not presented in cache, the performance will be significantly degraded by memory references. Therefore, elimination of redundant load-store operations is most important for improving overall performance. This paper introduces a method for integrating software pipelining and load-store elimination techniques. Moreover, we demonstrate that integrated algorithm is more effective than other methods
    • …
    corecore